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Description 

[0001 ] The present invention relates to a microphone array input type speech recognition scheme in which speeches 
uttered by a user are inputted through microphone an-ay and recognized. 

5 [0002] In the speech recognition, a surrounding environment under which the speech input is made can largely affect 
.the recognition performance. In particular, background noises and reflected sounds of the user's speeches can degrade 
the recognition performance so that they are sources of a serious problem encountered in a use of a speech recognition 
system. For this reason, in general, a short range microphone designed for use near the mouth of the user such as a 
.headset microphone or a hand microphone has been employed, but it is uncomfortable to wear the headset microphone 

10 on a head for any extended period of time, while the hand microphone can limit a freedom of the user as it occupies 
the user's hands, and there has been a demand for a speech input scheme that can allow more freedom to the user. 
[0003] A microphone array has been studied as a potential candidate for a speech input scheme that can resolve 
the conventionally encountered inconvenience described above, and there are some recent reports of its application 
to the speech recognition system. The microphone an^y is a set of a plurality of microphones which are arranged at 

15 spatially different positions, where noises can be reduced by the synthetic processing of outputs of these microphones. 
[0004] Fig. 1 shows a configuration of a conventional speech recognition system using a microphone array. This 
speech recognition system of Fig. 1 comprises a speech input unit 11 having a microphone array fomned by a plurality 
(N sets) of microphones, a sound source direction estimation unit 12, a sound source waveform estimation unit 13, a 
speech detection unit 14. a speech analysis unit 15, a pattem matching unit 16, and a recognition dictionary 17. 

20 [0005] In this configuration of Fig. 1 , the speech entered at the microphone array is converted into digital signals for 
respective microphones by the speech input unit 11, and the speech waveforms of all channels are entered into the 
sound source direction estimation unit 12. 

[0006] At the sound source direction estimation unit 12, a sound source position or direction is estimated from time 
differences among signals from different microphones, using the known delay sum an-ay method or a method based 
25 on the cross-correlation function as disclosed in U. Bub, et al.: "Knowing Who to Listen to in Speech Recognition: 
Visually Guided Beamfomiing", ICASSP '95, pp. 848-851, 1995. 

[0007] A case of estimating a direction of the sound source and a case of estimating a position of the sound source 
respectively con-espond to a case in which the sound source is far distanced from the microphone array so that the 
incident sound waves can be considered as plane waves and a case in which the sound source is relatively close to 
30 the microphone an^ay so that the sound waves can be considered as propagating in forms of spherical waves. 

[0008] Next, the sound source wavefomn estimation unit 13 focuses the microphone array to the sound source po- 
sition or direction obtained by the sound source direction estimation unit 12 by using the delay sum anray method, and 
estimates the speech waveform of the target sound source. 

[0009] Thereafter, similariy as in the usual speech recognition system, the speech analysis is carried out for the 

35 obtained speech waveform by the speech analysis unit 15, and the pattern matching using the recognition dictionary 
17 is carried out for the obtained analysis parameter, so as to obtain the recognition result. For a method of pattem 
matching, there are several known methods including the HMM (Hidden Markov Model), the multiple similarity method, 
and the DP matching, as detailed in Rabiner et al.: "Fundamentals of Speech Recognition", Prentice Hall, for example. 
[0010] Now, in the speech recognition system, it is custom to input the speech wavefonm. For this reason, even in 

40 the conventional speech recognition system using the microphone array as described above, the sound source position 
(or the sound source direction) and the speech waveform are obtained by processing the microphone array outputs 
according to the delay sum an-ay method, due to a need to estimate the speech wavefomi by a small amount of 
calculations. The delay sum an^y method is often utilized because the speech waveform can be obtained by a relatively 
small amount of calculations, but the delay sum array method is also associated with a problem that the separation 

45 power is lowered when a plurality of sound sources are located close to each other. 

[001 1] On the other hand, as a method for estimating the sound source position (or direction), there is a parametric 
method based on a model as disclosed in S. V. Pillai: "Array Signal Processing", Springer-Veriag, New York, 1989. for 
example, which is presumably capable of estimating the sound source position at higher precision than the delay sum 
array method, and which is also capable of obtaining the power spectrum necessary for the speech recognition from 

50 the sound source position estimation processing at the same time. 

[0012] Fig. 2 shows a processing configuration for this conventionally proposed parametric method. In the configu- 
ration of Fig. 2. signals from a plurality of microphones are entered at a speech input unit 21 , and the frequency analysis 
based on the FFT (Fast Fourier Transform) is earned out at a frequency analysis unit 22. Then, the sound source 
position estimation processing is carried out for each frequency component at a power estimation unit 23, and the final 

55 sound source position estimation result is obtained by synthesizing the estimation results for all the frequencies at a 
sound source direction judgment unit 24. 

[001 3] Here, the sound source position estimation processing is a processing for estimating a power at each direction 
or position while minutely changing a direction or position over a range in which the sound source can possibly be 
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located, so that a very large amount of calculations are required, in particular, in a case of assuming the propagation 
of sound waves in forms of spherical waves, it is going to estimate a position of the sound source rather than an arriving 
direction of the sound waves, so that two- or three-dimensional scanning is necessary and consequently an enormous 
amount of calculations are required. 
5 [0014] Moreover, In the conventionally proposed parametric method described above, it Is necessary to carry out 
.this scanning processing for each frequency component obtained by the fast Fourier transform of the speech, so that 
it is difficult to reduce a required amount of calculations. 

[001 5] The document "Multi-microphone correlation based processing for robust recognition", by Thomas M. Sullivan 
^and Richard M. Stem, IEEE Speech processing, Minneapolis, 1993, pages 11-91 to II -94 discloses a method of signal 
10 processing for robust speech recognition using multiple microphones. Signals from each microphone are entered 
through a bank of band-pass filters, and the output of the band-pass filters is then entered into non linear rectifiers for 
rectifying the shape of the entered signal. The outputs from each rectifier are then cross-correlated within each fre- 
quency band. 

[001 6] It is therefore an object of the present invention to provide a method and a system for microphone array input 

15 type speech recognition capable of realizing a high precision sound source position or direction estimation by a small 
amount of calculations, and thereby realizing a high precision speech recognition. This object is achieved by obtaining 
a band-pass waveform, which is a wavefonm for each frequency bandwidth, from input signals of the microphone anray, 
and directly obtaining a band-pass power of the sound source from the band-pass waveform. Then, the obtained band- 
pass power can be used as the speech parameter 

20 [001 7] It is another object of the present invention to provide a method and a system for microphone array input type 
speech recognition capable of realizing the sound source estimation and the band-pass power estimation at high 
precision while further reducing an amount of calculations. This object is achieved by utilizing a sound source position 
search processing in which a low resolution position estimation and a high resolution position estimation are combined. 
[001 8] These objects are solved by the system of claim 1 and the method of claim 12. Advantageous embodiments 

25 are described in the dependent claims. 

[0019] According to one aspect of the present application there is provided a microphone array input type speech 
recognition system, comprising: a speech input unit for inputting speeches in a plurality of channels using a microphone 
array formed by a plurality of microphones; a frequency analysis unit for analyzing an input speech of each channel 
inputted by the speech input unit, and obtaining band-pass waveforms for each channel, each band-pass waveform 

30 being a waveform for each frequency bandwidth; a sound source position search unit for calculating a band-pass power 
distribution for each frequency bandwidth from the band-pass waveforms for each frequency bandwidth obtained by 
the frequency analysis unit, synthesizing calculated band-pass power distributions for a plurality of frequency band- 
widths, and estimating a sound source position or direction from a synthesized band-pass power distribution; a speech 
parameter extraction unit for extracting a speech parameter for speech recognition, from the band-pass power dlstri- 

35 button for each frequency bandwidth calculated by the sound source position search unit, according to the sound source 
position or direction estimated by the sound source position search unit; and a speech recognition unit for obtaining a 
speech recognition result by matching the speech parameter extracted by the speech parameter extraction unit with 
a recognition dictionary. 

[0020] According to another aspect of the present application there is provided a microphone an^ay input type speech 

40 analysis system, comprising: a speech input unit for inputting speeches in a plurality of channels using a microphone 
array fomried by a plurality of microphones; a frequency analysis unit for analyzing an input speech of each channel 
inputted by the speech input unit, and obtaining band-pass wavefomns for each channel, each band-pass waveform 
being a wavefomn for each frequency bandwidth; a sound source position search unit for calculating a band-pass power 
distribution for each frequency bandwidth from the band-pass waveforms for each frequency bandwidth obtained by 

45 the frequency analysis unit, synthesizing calculated band-pass power distributions for a plurality of frequency band- 
widths, and estimating a sound source position or direction from a synthesized band-pass power distribution; and a 
speech parameter extraction unit for extracting a speech parameter from the band-pass power distribution for each 
frequency bandwidth estimated by the sound source position search unit, according to the sound source position or 
direction estimated by the sound source position search unit. 

50 [0021] According to another aspect of the present application there is provided a microphone array input type speech 
analysis system, comprising: a speech input unit for inputting speeches in a plurality of channels using a microphone 
array formed by a plurality of microphones; a frequency analysis unit for analyzing an input speech of each channel 
inputted by the speech input unit, and obtaining band-pass waveforms for each channel, each band-pass waveform 
being a waveform for each frequency bandwidth; and a sound source position search unit for calculating a band-pass 

55 power distribution for each frequency bandwidth from the band-pass waveforms for each frequency bandwidth obtained 
by the frequency analysis unit, synthesizing calculated band-pass power distributions for a plurality of frequency band- 
widths, and estimating a sound source position or direction from a synthesized band-pass power distribution. 
[0022] According to another aspect of the present application there is provided a microphone array Input type speech 
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recognition method, comprising the steps of: Inputting speeches in a plurality of channels using a microphone array 
formed by a plurality of microphones; analyzing an input speech of each channel inputted by the inputting step, and 
obtaining band-pass waveforms for each channel, each band-pass waveform being a waveform for each frequency 
bandwidth; calculating a band-pass power distribution for each frequency bandwidth from the band-pass waveforms 
for each frequency bandwidth obtained by the analyzing step, synthesizing calculated band-pass power distributions 
.for a plurality of frequency bandwidths, and estimating a sound source position or direction from a synthesized band- 
pass power distribution; extracting a speech parameter for speech recognition, from the band-pass power distribution 
for each frequency bandwidth calculated by the calculating step, according to the sound source position or direction 
^estimated by the calculating step; and obtaining a speech recognition result by matching the speech parameter ex- 
tracted by the extracting step with a recognition dictionary. 

[0023] According to another aspect of the present application there is provided a microphone array input type speech 
analysis method, comprising the steps of: inputting speeches in a plurality of channels using a microphone an^ay formed 
by a plurality of microphones; analyzing an input speech of each channel inputted by the inputting step, and obtaining 
band-pass wavefomns for each channel, each band-pass waveform being a waveform for each frequency bandwidth; 
calculating a band-pass power distribution for each frequency bandwidth from the band-pass waveforms for each 
frequency bandwidth obtained by the analyzing step, synthesizing calculated band-pass power distributions for a plu- 
rality of frequency bandwidths, and estimating a sound source position or direction from a synthesized band-pass 
power distribution; and extracting a speech parameter from the band-pass power distribution for each frequency band- 
width calculated by the calculating step, according to the sound source position or direction estimated by the calculating 
step. 

[0024] According to anotheraspect of the present application there is provided a microphone array input type speech 
analysis method , comprising the steps of: inputting speeches in a plurality of channels using a microphone array formed 
by a plurality of microphones; analyzing an input speech of each channel inputted by the inputting step, and obtaining 
band-pass waveforms for each channel, each band-pass waveform being a waveform for each frequency bandwidth; 
and calculating a band-pass power distribution for each frequency bandwidth from the band-pass waveforms for each 
frequency bandwidth obtained by the analyzing step, synthesizing calculated band-pass power distributions for a plu- 
rality of frequency bandwidths, and estimating a sound source position or direction from a synthesized band-pass 
power distribution. 

[0025] Other features and advantages of the present invention will become apparent from the following description 
taken in conjunction with the accompanying drawings. 

[0026] Fig. 1 is a block diagram of a conventional microphone an'ay input type speech recognition system. 
[0027] Fig. 2 is a block diagram of a processing configuration for a conventionally proposed parametric method for 
estimating the sound source position or direction. 

[0028] Fig. 3 is a block diagram of a microphone an-ay input type speech recognition system according to the first 
embodiment of the present invention. 

[0029] Fig. 4 is a diagram of a filter function to be used in a sound source position search unit in the system of Fig. 3. 
[0030] Figs. 5A and 5B are diagrams showing a relationship between a sound source position and microphone 
positions in the system of Fig. 3, for a case of direction estimation and for a case of position estimation, respectively 
[0031] Fig. 6 is a diagram for explaining a peak detection from a sound source power distribution after the synthesizing 
processing in the system of Fig. 3. 

[0032] Fig. 7 is a block diagram of one exemplary configuration for a speech recognition unit in the system of Fig. 3. 
[0033] Fig. 8 is a block diagram of another exemplary configuration for a speech recognition unit in the system of 
Fig. 3. 

[0034] Fig. 9 is a flow chart for the overall processing in the system of Fig. 3. 

[0035] Fig. 1 0 is a diagram for explaining a read out waveform data for one frame used in a calculation of a correlation 
matrix at a sound source position search unit in the system of Fig. 3. 

[0036] Figs. 11 A and 11 B are diagrams showing a relationship between a resolution and an increment value for 
search in a sound source position estimation in the system of Fig. 3. 

[0037] Fig. 12 is a block diagram of a sound source position search unit in the system of Fig. 3 according to the 
second embodiment of the present invention. 

[0038] Fig. 13 is a flow chart for the processing of the sound source position search unit of Fig. 12. 

[0039] Refening now to Fig. 3 to Fig. 10, the first embodiment of a method and a system for microphone an^y input 

type speech recognition according to the present invention will be described in detail. 

[0040] Fig. 3 shows a basic configuration of a microphone an^y input type speech recognition system in this first 
embodiment. This speech recognition system of Fig. 1 comprises a speech input unit 1. a frequency analysis unit 2, a 
sound source position search unit 3. a speech parameter extraction unit 4, and a speech recognition unit 5. 
[0041] The speech input unit 1 has a microphone array (not shown) formed by N sets (8 sets, for example) of micro- 
phones, and converts speeches entered from the microphone array into digital signals. 
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[0042] The frequency analysis unit 2 analyzes the Input speech for each microphone (channel) entered at the speech 
input unit 1 by a band-pass filter bank (a group of band-pass filters), and obtains a band*pass waveform which is a 
waveform for each frequency bandwidth. 

[0043] The sound source position search unit 3 estimates a power amving from each position or direction for each 
5 bandwidth according to the band-pass waveform for each frequency bandwidth obtained for each channel by the fre- 
.quency analysis unit 2, as a sound source position judgement information, and identifies a sound source direction or 
position by synthesizing the obtained sound source position judgement information for a plurality of frequency band- 
widths. 

. [0044] The speech parameter extraction unit 4 extracts the band-pass power of the speech signals arrived from the 
10 sound source direction or position identified by the sound source position search unit 3, as a speech parameter, ac- 
cording to the sound source position judgement information obtained at the sound source position search unit 3. 
[0045] The speech recognition unit 5 canies out the speech recognition by matching the speech parameter extracted 
by the speech parameter extraction unit 4 with a recognition dictionary. 

[0046] Now, the outline of the overall operation in the speech recognition system of Fig. 3 will be described. 

15 [0047] First, the speeches entered at N (=8) sets of microphones are AD converted at the sampling frequency such 
as 12 KHz for example, microphone channel by microphone channel, at the speech input unit 1 . Then, the frequency 
analysis is carried out at the frequency analysis unit 2, to obtain band-pass waveforms for a plurality of bands (band- 
widths) corresponding to microphones. Here, a number M of bands used in the analysis is assumed to be equal to 16. 
The transmission bandwidths of the band-pass filters are to be determined as those required at the speech recognition 

20 unit 5. A manner of constructing the band-pass filters is well known and can be found in Rabiner et al.: "Fundamentals 
of Speech Recognition", Prentice Hall, for example. 

[0048] Next, at the sound source position search unit 3. an arriving power in each bandwidth is estimated for each 
position or direction, according to the band-pass waveforms for N (=8) channels in each bandwidth outputted by the 
frequency analysis unit 2, as the sound source position judgement Information. This processing is repeated for M (=16) 
25 times. This calculation of the sound source position judgement infomnation is a calculation of an an^iving power while 
sequentially displacing an assumed sound source position or direction, so as to obtain a distribution of arriving powers 
over a range in which the sound source can be located. 

[0049] Thereafter, the sound wave arriving direction or the sound source position is estimated by synthesizing the 
above described power distribution obtained for each of M frequency bandwidths. Here, a position or direction with a 

30 large value in the power distribution is to be estimated as that of the source source. 

[0050] In addition, at the speech parameter extraction unit 4, the sound source power (band-pass power) at the 
sound wave arriving direction or the sound source position is extracted, from the sound source position judgement 
information estimated for each bandwidth at the sound source position search unit 3, as the speech parameter. This 
speech parameter is then given to the speech recognition unit 5, where the speech recognition result is obtained and 

35 outputted. 

[0051] As described, in this first embodiment, the sound source position is determined according to the estimated 
power distribution for each frequency bandwidth to be used in the speech recognition, and the speech parameter is 
obtained according to the determined sound source position, so that even when the sound source position is unknown, 
it is possible to realize the speech recognition by directly obtaining the speech parameter at high precision with a small 

40 amount of calculations. 

[0052] Note that, when the sound source position is known, it suffices to obtain the arriving power value by limiting 
a power distribution calculation range to a single known sound source position or direction, and the configuration of 
Fig. 3 is also applicable to this case without any change. This simplified operation is effective when it is possible to 
assume that a user makes a speech input by approaching to a specific location. 

45 [0053] Next, the detailed operation for obtaining the power distribution from a plurality of band-pass waveforms at 
the sound source position search unit 3 will be described. 

[0054] At the sound source position search unit 3, in order to obtain a power at each direction or position from a 
plurality (M sets) of band-pass wavefonns. the calculation of the minimum variance method is canied out. The minimum 
variance method is well known and described in Haykin: "Adaptive Filter Theory", Prentice Hall, for example. 

50 [0055] In this first embodiment, at a time of the sound source power estimation by the minimum variance method, 
in order to deal with signals having a certain bandwidth which is not a narrow bandwidth, a filter function as indicated 
in Fig. 4 is realized by calculation, where the band-pass waveforms of the same frequency bandwidth obtained for N 
sets of microphones (i = 1 to N) by the frequency analysis unit 2 are added by an adder 32 after passing through 
transversal filters 31-1 to 31-N with multiple delay line taps corresponding to N sets of microphones (i = 1 to N). Here. 

55 the filter coefficients w^^ to w^j, , v^^^ to W|^j of the filters 31-1 to 31-N are switchably set up for each bandwidth. 

so as to realize the filter function for all bandwidths. 

[0056] In the configuration of Fig. 4, a number of taps in the filter is denoted as J. and a filter coefficient of the i-th 
microphone (microphone No. i) is denoted as Wy (1 ^ i ^ N, 1 ^ j ^ J). Here. J is equal to 10, for example, but this 
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15 



setting may be be changed depending on a width of the bandwidth. 
[0057] The filter output y in this configuration of Fig. 4 can be expressed as follows. 

[0058] First, by denoting the band-pass waveform of the k-th frequency bandwidth (1 ^ Ic ^ M) obtained from the 
waveform for the i-th microphone as Xj|((n), and by arranging band-pass waveform sample sequences: Xjj^ = (X{)((n-J+1 ), 

Xj)((n-J+2), X{|^(n-1), from J samples past of a certain time n up to this time n, for all N sets of microphones, 

. it is possible to obtain a vector given by the following equation (1 ). 

[0059] Also, by an-anging the filter coefficients Wj j, it is possible to obtain a vector given by the following equation (2). 

Wk = (Wii . Wi2. w^ J. W21 , W22. 

.Wni,Wn2, .Wnj)^ (2) 

[0060] Using the above equations (1) and (2), the filter output y can be expressed as: 

V = ^k \ (3) 

where • denotes a complex conjugate of a vector. In this expression, Xi^ is usually called a snap-shot. 
[0061] Now, denoting the expectation value as E[ ], the expectation value of the filter output power y^ is expressed as : 

E[y^l = EK \H = w„ • Rk W|, (4) 

where = E[x^ x^^*] is a correlation matrix of x. Then, the estimation vector according to the minimum variance method 
is obtained by minimizing this expectation value E[y2] underthe constraintconditions that a response of the microphone 
array for a target direction or position is to be maintained constant. 
[0062] These constraint conditions can be expressed as: 

35 Wk*A = g (5) 

where g is a column vector with constant values in a size equal to a number L of constraint conditions. For example, 

this g can be [1 , 1 , 1 ]. Also, A is a matrix formed by direction control vectors am for different frequencies as column 

vectors. This matrix A can be expressed as: 



20 



25 



40 



45 



50 



55 



A = [ai,a2, .aLl (6) 

and each direction control vector a^ (m = 1, 2, , L) can be expressed as: 

a„ = (1.a,e^"'-^^ a.e^'""'^'' ) (7) 

where t2, , x^, are propagation time differences of the incident sound wave for the second to N-th microphones with 

reference to the first microphone, respectively. Note that the propagation time difference Xi of the incident sound wave 

for the first microphone is set equal to zero. Also, 00,^ is an angular frequency, and a2 , are amplitude ratio of 

ttie Incident sound wave for the second to N-th microphone with reference to the first microphone, respectively, l-lere, 
L Is set equal to 10, for example, and (Oj^ is set to be co^ = ((Q)3-<0b)/(L-1))'m + (o^, where (Og is an upper limit angular 
frequency of the bandwidth and (05 is a lower limit angular frequency of the bandwidth. 

[0063] When the problem of minimization under the constraints given by the equations (4) and (5) is solved by the 
Lagrange's method of indeterminate coefficients, the filter coefficient w^ for minimizing the aniving power from any 
direction or position other than the sound source direction 6 or the sound source position 6 can be given by the following 
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equation (8). 



Wk = Rk''' A(A Rk"V)"^g (8) 

5 

. [0064] Using this filter coefficient w^, the arriving power (arriving band-pass power, sound source power) P|^(e) for 
the k4h bandwidth from the sound source 6 can be calculated as the following equation (9). 



10 



20 



25 



30 



Pk(e) = g (A R^-' A)-'g (9) 



In a case of the sound source position estinnation, 9 is taken as a vector for expressing the coordinates. 
[0065] Now, with reference to Figs. 5A and 5B, a manner of obtaining the propagation time difference and the am- 
plitude for each microphone will be described. Here, the explanation Is given on two-dimensional plane for the sake 
^5 of simplicity, but the extension to the three-dimensional space should be obvious. 

[0066] First, as shown in Fig. 5A, the coordinates of the first microphone (No. 1) are denoted as (x^, y^) and the 
coordinates of the l-th microphone (No. i) are denoted as (Xj, yj). Then, for a case of the plane waves, when the sound 
waves are incident from a direction 6, the propagation time difference Xj of the incident sound waves at the i-th micro- 
phone and the first microphone is given by: 



and the amplitude can be assumed as: 



X|(e) = ((Xi-x,)' + (yj-yi)V 

cos(e.tan"^(yi-yi)/(Xj-Xi))) (10) 



81=82 = = aN = 1 (11) 



[0067] On the other hand, in a case of a point sound source, as shown in Fig. 5B, when an assumed sound source 
position e Is located at (X3, y^), the propagation time difference tj and the amplitude aj can be given by: 

35 tj = (((Xi-X3)2 + (y,-y3)')''' 

-((Xi-X3)^ + (yi-y3)^)''^yc (12) 



and 

40 

ai = ((X|-X3)' + (yi-y3)^)'^' 
/((xrX3)' + (yi-y3)^)''' (13) 

45 

where c is the speed of sound. 

[0068] Pk(e) given by the above equation (9) becomes large when 9 coincides with the aniving direction or the sound 
source position, or small when they do not coincide. For this reason, by calculating P^{B) for each direction or position, 
the amving direction or the sound source position can be estimated as a position of the peak. 

^ [0069] To this end, in a case of obtaining the sound source direction, the sound source position search unit 3 calcu- 
lates P|((9) while sequentially changing 9 gradually, 1 by 1 ° for example. Also, in a case of obtaining the sound source 
position, the sound source position search unit 3 calculates P|((9) for lattice points at 2 cm interval for example, within 
the search range. The Increment value for 9 may be changed to any appropriate value depending on factors such as 
a wavelength and a distance to an assumed sound source position. 

^ [0070] Next, at the sound source position search unit 3. Py{B) of the equation (9) obtained for all bandwidths are 
synthesized to estimate the sound source position or the sound source direction. 

[0071] Here, the synthesizing can be realized by multiplying a weight to the arriving power distribution P^{Q) for 
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each bandwidth, and taking a sum for all the frequency bandwidths from Ic = 1 to l( = 1^, that is: 

P(e)to^, = iw,Pk(e) (14) 

. and estimating the sound source from a peak on the distribution after this synthesizing processing (the total sound 
source power distribution). 

[0072] Here, all the weights Wi^ may be set equal to 1 , or the weight for the frequency of a noise source with a known 
» frequency characteristic such as a power source frequency may be set small so as to reduce the influence of the noise. 
[0073] The detection of the sound source is earned out according to a size of a peak in P(d)totai as described above, 
and a single largest peak can be detected as the sound source. Alternatively, as shown in Fig. 6, by setting a prescribed 
threshold with reference to an average value of portions other than peak portions on the synthesized (total) sound 
source power distribution, such as 5 dB, and all peaks above this threshold may be detected as the sound sources, 
while not detecting any sound source at ait when there is no peak above this threshold. 

[0074] In this manner, the arriving power distribution P^{Q) for each bandwidth given by the equation (9) is used in 
Judging whether the sound source exists at an assumed sound source position which is set to be each direction or 
position determined with reference to positions of a plurality of microphones, so that this aniving power distribution P^ 
(9) will be referred to as the sound source position judgement information. 

[0075] Next, the speech parameter extraction unit 4 can extract the power of the k-th frequency bandwidth of the 
sound source, from the already obtained arriving power distribution P^{B) for each bandwidth, according to the sound 
source direction or position obtained by the sound source position search unit 3. Consequently by extracting the power 
for all the bandwidths from k = 1 to k = M, it is possible to obtain the band-pass power to be used as the speech 
parameter. 

[0076] The band-pass power of the sound source obtained in this manner is sent from the speech parameter extrac- 
tion unit 4 to the speech recognition unit 5, and used in the speech recognition processing. 
[0077] As shown in Fig. 7. the speech recognition unit 5 comprises a speech power calculation unit 501 , a speech 
detection unit 502, a pattern matching unit 503, and a recognition dictionary 504. 

[0078] In this speech recognition unit 5, the speech power is calculated from the speech parameter (band-pass 
power) extracted by the speech parameter extraction unit 4, and a speech section Is detection by the speech detection 
unit 502 according to the calculated speech power. Then, for the speech parameter in the detected speech section, 
the pattern matching with the recognition dictionary 504 is carried out by the pattern matching unit 503. 
[0079] Note that, as shown in Fig. 8, the speech recognition unit 5 may be formed by a pattern matching unit 511 , a 
recognition dictionary 512, and a speech detection unit 513, so as to carry out the word spotting scheme in which the 
continuous pattern matching for the speech parameter is carried out and a section with the largest matching score is 
detemnined as a speech section. 

[0080] The total speech power can be obtained by summing powers for all the bandwidths extracted by the speech 
parameter extraction unit 4, so that it is possible to use the known speech section detection method based on the 
speech power as disclosed in L. F. Lamel et al.: "An Improved Endpoint Detector for Isolated Word Recognition", iEEE 
Transactions on Acoustics. Speech, and Signal Processing, Vol.ASSP-29, No. 4, pp. 777-785, August 1 981 . The above 
processing is to be carried out for each frame of the input speech waveform data so as to recognize the speech 
continuously. 

[0081] In this first embodiment, using the configuration of the microphone array input type speech recognition system 
as described above, the speech recognition is realized by directly obtaining the band-pass power which is the speech 
parameter. Consequently, it is possible to use the minimum variance method which is a sound source direction or 
sound source position estimation method with a high precision and a small amount of calculations. 
[0082] Now, the flow of processing as described above will be summarized with reference to the flow chart of Fig. 9. 
[0083] First, prior to the start of the processing, the initial setting is made for factors such as whether a direction 
estimation is to be used or a position estimation is to be used, a range of sound source search to be used, and an 
increment value to be used In the search (step SI). In an example of Fig. 9, a direction estimation Is to be used, a 
searching range is from -90** to +90**. and an increment value for search is 1**. 

[0084] Next, the speeches entered at N sets of microphones are A/D converted at the sampling frequency of 12 KHz 
for example, for all N channels In parallel, by the speech input unit 1 . The obtained waveform data are then stored in 
a buffer (not shown) of the speech input unit 1 (step S2). Normally, this step S2 is carried out continuously in real time, 
regardless of the other processing. 

[0085] Next, the wavefomn data for each channel is read out from the buffer for one frame size, 256 points for example, 
and applied to the band-pass filter bank at the frequency analysis unit 2, so as to extract the band-pass waveform for 
each frequency bandwidth k (k = 1 to IVI), where M = 16 in this example (step S3). Here, the calculations for the band- 
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pass filters may be carried out independently for each microphone in parallel, or sequentially for each microphone 
serially. 

[0086] Next, at the source source position search unit 3, using the band-pass wavefomn data for N channels obtained 
by the frequency analysis unit 2 at the step S3, the con-elation matrix for each frequency bandwidth k is obtained 
5 (step S4). Here, as shown in Fig. 1 0, the calculation of the correlation matrix 1^ Is realized by obtaining the correlation 
. matrix as a time average of the auto-correlation matrices of samples for 20 frames taken at an Interval of 2 samples 
from a frame data of 256 samples (points) for example. 

[0087] In addition, at the step S4, using this congelation matrix R^^, the arriving power distribution Pk(e) = g- (A- R^'^ 
, A)-i g Is obtained as the sound source position judgement infomnation for each assumed position or direction. This 

^0 calculation is carried out over the entire space to be searched through, so as to obtain the spatial distribution of the 
arriving powers. As for the bandwidths, when M = 16, the calculation is earned out from k = 1 to k = 16. 
[0088] Next, at the sound source position search unit 3, the arriving power distributions P|^(0) for different frequency 
bandwidths are summed over the entire frequency bandwidths, for each 6, so as to obtain the total sound source power 
distribution P(6)totai- Then, the largest peak is extracted from this P(6)totai and identified as the sound source position 

15 Go (step S5). 

[0089] Next, at the speech parameter extraction unit 4, a value on the aniving power distribution (sound source 
position judgement Infonmatlon distribution) P^(e) for each frequency bandwidth obtained by the sound source position 
search unit 3 at the sound source position Oq is extracted, and this is repeated for all the frequency bandwidths for 
each sound source, so as to obtain the speech parameter Pk(6o) (step S6). 
20 [0090] In addition, at the step S6. the powers for different bandwidths k of the speech parameter P^(B) are summed 
to obtain the power for the entire speech bandwidth at the speech power calculation unit 501 of the speech recognition 
unit 5. 

[0091 ] Next, using the power for the entire speech bandwidth obtained at the step S6, the speech section Is detected 
by the speech detection unit 502 of the speech recognition unit 5 (step S7). 
25 [0092] Then, whether the end of the speech section is detected by the speech detection unit 502 or not is judged 
(step SB), and if not, the processing returns to the step S2 to cany out the frequency analysis for the next waveform 
data frame. 

[0093] On the other hand, when the end of the speech section is detected, a matching of the speech parameter in 
the detected speech section with the recognition dictionary 504 Is carried out. and the obtained recognition result is 
30 outputted (step S9). Then, the processing returns to the step S2 to carry out the frequency analysis for the next wave- 
form data frame. 

[0094] Thereafter, the above processing is repeated so as to cany out the speech parameter estimation and the 
speech recognition continuously. 

[0095] Note that the processing described above can be carried out at high speed by adopting the pipeline processing 
35 using a plurality of processors (as many processors as a number of microphones, for example) which are operated in 
parallel. 

[0096] Referring now to Fig. 1 1 A to Fig. 1 3. the second embodiment of a method and a system for microphone an^y 

input type speech recognition according to the present invention will be described in detail. 

[0097] This second embodiment is directed to a scheme for further reducing an amount of calculations in the sound 

40 source position estimation, by changing the constiraint conditions used In the spectrum estimation so as to cont'd the 
resolution, and changing the search density according to the resolution so as to reduce an amount of calculation in 
the sound source search. In this second embodiment, the basic configuration of the speech recognition system is the 
same as that of the first embodiment, so that Fig. 3 will be also referred in the following description. 
[0098] In the first embodiment described above, the constraint conditions used in the spectrum estimation based on 

45 the minimum variance method are that a response of the microphone array for one direction or position is to be main- 
tained constant. In this case, the resolution of the estimation Is sufficientiy high, so that tiie peak is found by tiie dense 
search in which the arriving power is obtained while changing 8 1"" by 1"" for example within the sound source search 
range. 

[0099] When the resolution Is so high as in this case, as shown in Fig. 11 A. there is a possibility for felling to detect 
50 an accurate apex of the peak when the search is carried out not so densely, so that an amount of calculations for the 
search cannot be reduced easily. 

[01 00] In contrast, when the resolution of the sound source position estimation processing can be lowered, as shown 
in Fig. 1 1 B, a possibility for overiooking the peak position can be made low even when the search is carried out coarsely, 
so thai an amount of calculations can be reduced. In this case, however, it may not be possible to separate the closely 
55 located sound sources, or the estimation precision may be lowered, as much as the resolution is lowered. 

[0101] For tills reason, this second embodiment adopts a scheme in which the lower resolution search is carried out 
first and then the high resolution search is carried out only in a vicinity of the peak, so as to realize tiie high precision 
sound source position estimation with a small amount of calculations. This scheme will now be described in detail. 
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[0102] The resolution at a time of the sound source position estimation can be controlled by requiring that responses 
of the microphone anray for a plurality of directions or positions are to be simultaneously maintained constant, instead 
of just requiring a response of the microphone array for one direction or position to be maintained constant as in the 
constraint conditions of the equation (5). 

[01 03] For example, using two time delays ti(e^) and 12(62) two angles and 02 (according to the equation (1 0)), 

. it is possible to use two direction control vectors and (m = 1. 2, L) given by the following equations 

(15) and (16). 

aje,) = (1.a,e^'""^^^^^'\ a.e"^'"" ^" (15) 

aje,) = (1.a,e^"™^^'^^\ a.e^""' ^'^^^^^ (16) 



[0104] Then, using these two direction control vectors a^COi) and Btry{^2)* >t possible to set: 

A = [ai(0i). agCei), , aJGi). 

20 31(62). 82(62), .aL(02)] (17) 

so as to mal<e responses of the microphone anray to two directions simultaneously. 

[0105] IHere, when 6^ and 62 are set to close values, such as 62 = 6^ + 1"" for example, it Is equivalent to a case of 
malting a response of the microphone array for a single direction with a width between 6^ and 62, so that it is equivalent 
to the towering of the resolution. Note here that a number of directions for which the responses of the microphone array 
are to be made simultaneously is not necessarily limited to two. 

[0106] When the resolution is lowered, the search can be coarser compared with a case of high resolution, so that 
an amount of calculations can be reduced. 

[01 07] Then, after the search using the low resolution sound source position estimation processing described above, 
the high resolution search as described in the first embodiment can be carried out only in a vicinity of the peak detected 
by the first search, so that the high precision sound source position estimation can be realized with a reduced amount 
of calculations overall. 

[0108] Fig. 12 shows a configuration of the sound source position search unit 3 in this second embodiment for real- 
izing the above described sound source position estimation processing. 

[0109] In this configuration of Fig. 12, the sound source position search unit 3 comprises a low resolution sound 
source position search unit 301 and a high resolution sound source position search unit 302. The low resolution sound 
source position search unit 301 coarsely estimates the arriving power distribution in terms of positions or directions by 
using the low resolution spectrum estimation. The high resolution sound source position search unit 302 densely es- 
timates the amving power distribution by using the high resolution spectrum estimation only In a vicinity of the position 
or direction obtained by the low resolution sound source position search unit 301. 

[0110] Now, the flow of processing In this sound source position search unit 3 In the configuration of Fig. 12 will be 
described with reference to the flow chart of Fig. 13. 

[0111] First, using the inputs of the band-pass waveforms corresponding to the microphone, the congelation matrix 
is calculated (step S1 1 ). Here, a method for obtaining this correlation matrix is the same as In tiie first embodiment. 
[01 12] Next, using the obtained correlation matrix R|^, the low resolution sound source position search is candied out 
(step S12). At this point, the increment value 6^ for search is set to be a relatively large value, such as 5° for example, 
so as to carry out the search coarsely over the entire search range. Also, in order to lower the resolution, a matrix as 
expressed by tiie equation (17) which has two direction control vectors an,(9-| ) and ^^(^2) for two directions or positions 
as expressed by the equations (15) and (16) as column vectors is used Instead of a matrix A in the equation (9). In 
Fig. 13, this matrix is denoted as B in order to distinguish It from a matrix A of the equation (6). The search is carried 
out for each bandwidth. 

[01 13] Next, the low resolution arriving power distributions for different bandwidths are synthesized, and tiie sound 
source position 6q is obtained from a peak therein (step SI 3). 

[0114] Next, in a vicinity of the sound source position obtained at the step SI 3. the high resolution sound source 
position search Is earned out. Here, tiie setting of the search range is set to be ±10° of the sound source position 
obtained at tine step S13. for example. At this point, tiie equation to be used for the arriving power estimation (amving 
power distribution) is tiie same as the equation (9), and ttie increment value is set to a smaller value such as 1"" for 
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example (step S14). 

[0115] Next, the high resolution arriving power distributions for different bandwidths obtained at the step S14 are 
synthesized, and the sound source position Oq ' Is obtained from a peak therein (step S15). 
[0116] At the speech parameter extraction unit 4, the power (speech parameter) of the sound source is extracted 
5 from the arriving power distribution obtained by the high resolution sound source position search at the high resolution 
.sound source position search unit 302 of the sound source position search unit 3. 

[01 17] As described, in this second embodiment, by using the sound source position search processing in which the 
low resolution sound source position estimation and the high resolution sound source position estimation are combined, 
Jt is possible to estimate the sound source position and its band-pass power while reducing an amount of calculations 
10 considerably. 

[0118] As described, according to the present invention, a band-pass wavefomn which is a waveform for each fre- 
quency bandwidth is obtained from Input signals of the microphone array, and a band-pass power of the sound source 
is directly obtained from the band-pass wavefomn, so that it is possible to realize a high precision sound source position 
or direction estimation by a small amount of calculations. Moreover, the obtained band-pass power can be used as 

15 the speech parameter so that it is possible to realize a high precision speech recognition. 

[0119] In the speech recognition system of the present invention, the Input signals of the microphone array entered 
by the speech input unit are frequency analyzed by the frequency analysis unit, so as to obtain the band-pass waveform 
which is a waveform for each frequency bandwidth. This band-pass waveform is obtained by using the band-pass filter 
band (a group of band-pass filters), instead of using the frequency analysis based on FFT as in the conventional speech 

20 recognition system. Then, the band-pass power of the sound source is directly obtained from the obtained band-pass 
waveform by the sound source position search unit. 

[01 20] Here, in order to handle signals within some bandwidth collectively, a filter configuration (filter function) having 
a plurality of delay line taps for each microphone channel is used and the sound source power is obtained as a sum 
of the filter outputs for all channels, while using the minimum variance method which is a known high precision spectrum 
25 estimation method. 

[0121] The sound source power estimation processing using the minimum variance method is also used in the con- 
ventionally proposed parametric method described above, but a use of only one delay line tap has been assumed 
conventionally, so that it has been impossible to obtain the bandwidth power collectively 

[0122] In contrast, In the speech recognition system of the present invention, a filter configuration with a plurality of 
30 delay line taps is used so that the power in each direction or position is obtained for each frequency bandwidth necessary 
for the speech recognition, rather than obtaining the power In each direction or position for each frequency, and therefore 
the obtained power can be directly used for the speech recognition while a required amount of calculations can be 
reduced. 

[0123] For example, in a case of using the conventional FFT with 512 points, it has been necessary to repeatedly 
35 obtain the power in each direction for each of 256 components, but in the present invention, when a number of bands 

in the band-pass filter bank is set to 16 for example, it suffices to estimate the power in each direction for 16 times. In 

addition, this power (band-pass power) can be estimated at higher precision compared with the conventional case of 

using the delay sum an^ay processing, so that it is possible to realize the high precision speech recognition. 

[0124] In the sound source position search unit, the synthesizing of the band-pass power distributions for a plurality 
^ of frequency bandwidths can be realized by multiplying respective weights to the band-pass power distributions of 

different bandwidths, and taking a sum of the weighted band-pass power distributions for ail the frequency bandwidths. 

Here, all the weights may be set equal to 1 , or the weight for the frequency of a noise source with a known frequency 

characteristic such as a power source frequency may be set small so as to reduce the influence of the noise. 

[0125] Also, in the sound source position search unit, the estimation of the sound source position or direction from 
45 the synthesized power distribution for each position or direction can be realized by detecting a peak in the synthesized 

power distribution and setting a position or direction con-esponding to the detected peak as the sound osurce position 

or direction. 

[0126] Furthermore, according to the present invention, by using a sound source position search processing in which 
a low resolution position estimation and a high resolution position estimation are combined, it is possible to realize the 
50 sound source estimation and the band-pass power estimation at high precision while further reducing an amount of 
calculations. 

[0127] In this case, the coarse search carried out by the sound source position search unit for the purpose of reducing 
an amount of calculations is the low resolution search so that it Is possible to make it highly unlikely to overtook the 
sound source position or direction (a peak position in the power distribution). 
55 [0128] Although there is a possibility for being unable to separate closely located sound sources or lowering the 
estimation precision by the low resolution search alone, the high resolution search is also carried out only In a vicinity 
of the sound source position or direction obtained by the low resolution search (in a vicinity of the detected peak), so 
that it is possible to realize the high precision sound source position estimation with a further reduced amount of cal- 
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culations. 

[0129] It is to be noted that, in the above description, the speech recognition system having a speech recognition 
unit has been described, but the speech parameter extraction technique according to the present invention can be 
utilized separately from the speech recognition device. 

5 [01 30] Namely, it is also possible to provide a microphone anray input type speech analysis system for analyzing the 
Jnput signals of the microphone an-ay and extracting the speech parameter, which is formed by the speech input unit, 
the frequency analysis unit, the sound source position search unit, and the speech parameter extraction unit substan- 
tially as described above, in which the band-pass wavefomn which is a waveform for each frequency bandwidth is 
.obtained from Input signals of the microphone array, a band-pass power of the sound source is directly obtained from 

10 the band-pass wavefomn. and the obtained band-pass power is used as the speech parameter. 

[0131] Simitariy, it is also possible to provide a microphone array input type speech analysis system for analyzing 
the input signals of the microphone array and estimating the sound source position or direction, which is formed by 
the speech input unit, the frequency analysis unit, and the sound source position search unit substantially as described 
above, in which the band-pass waveform which is a wavefomi for each frequency bandwidth is obtained from input 

IS signals of the microphone array, a band-pass power of the sound source is directly obtained from the band-pass wave- 
form, and the sound source position or direction is estimated according to a synthesized band-pass power distribution. 
[0132] Such a microphone an^ay input type speech analysis system according to the present invention is utilizabte 
not only In the speech recognition but also in the other speech related processing such as a speaker recognition. 
[0133] It is also to be noted that, besides those already mentioned above, many modifications and variations of the 

20 above embodiments may be made without departing from the novel and advantageous features of the present invention. 
Accordingly, all such modifications and variations are intended to be included within the scope of the appended claims. 



Claims 

25 

1 . A microphone array input type speech analysis system, comprising: 

a speech input unit (1) for inputting speeches in a plurality of channels using a microphone array formed by 
a plurality of microphones; and 

30 

a frequency analysis unit (2) for analysing an input speech of each channel inputted by the speech input unit, 
and obtaining band-pass wavefomns for each channel, each band-pass waveform being a wavefomi for each 
frequency bandwidth; 

35 characterized by 

a sound source position search unit (3) for calculating a band-pass power distribution as a function of sound source 
position or direction for each frequency bandwidth from the band-pass wavefomns for each frequency bandwidth 
obtained by the frequency analysis unit, synthesizing calculated band-pass power distributions for a plurality of 
frequency bandwidths, and estimating a sound source position or direction from a synthesized band-pass power 

40 distribution. 

2. The system of claim 1 . further comprising: 

a speech parameter extraction unit (4) for extracting a speech parameter from the band-pass power distribution 
45 for each frequency bandwidth calculated by the sound source position search unit, according to the sound 

source position or direction estimated by the sound source position search unit. 

3. The system of claim 2, further comprising: 

50 a speech recognition unit (5) for obtaining a speech recognition result by matching the speech parameter 

extracted by the speech parameter extraction unit with a recognition dictionary. 

4. The system of any one of claims 1 , 2 and 3, wherein the sound source position search unit includes: 

55 a low resolution sound source position estimation unit for estimating a rough sound source position or direction, 

by minimizing an output power of the microphone array under constraints that responses of the microphone 
array for a plurality of directions or positions are to be maintained constant; and 
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a high resolution sound source position estimation unit for estimating an accurate sound source position or 
direction in a vicinity of the rough sound source position or direction estimated by the low resolution sound 
source position estimation unit, by minimizing the output power of the microphone array under constraints that 
a response of the microphone array for one direction or position is to be maintained constant, wherein the 
speech parameter extraction unit extracts the speech parameter for speech recognition according to the ac- 
curate sound source position or direction. 

5. The system of claim 3, wherein the frequency analysis unit obtains the band-pass wavefomns for each channel by 
using a band-pass filter bank. 

6. The system of claim 3, wherein the sound source position search unit calculates the band-pass power distribution 
for each frequency bandwidth, by calculating a band-pass power for each frequency bandwidth, In each one of a 
plurality of assumed sound source positions or directions within a prescribed search range. 

7. The system of claim 3, wherein the sound source position search unit calculates the band-pass power distribution 
for each frequency bandwidth by using a filter function configuration having a plurality of delay line taps for each 
channel. 

8. The system of claim 3. wherein the sound source position search unit calculates the band-pass power distribution 
for each frequency bandwidth by using a minimum variance method for minimizing an output power of the micro- 
phone array under constraints that a response of the microphone array for one direction or position is to be main- 
tained constant. 

9. The system of claim 3, wherein the speech parameter extraction unit extracts the band-pass power distribution 
for each frequency bandwidth calculated by the sound source position search unit for the sound source position 
or direction estimated by the sound source position search unit directly as the speech parameter. 

10. The system of claim 3, wherein the sound source position search unit synthesizes the calculated band-pass power 
distributions for a plurality of frequency bandwidths by weighting the calculated band-pass power distributions with 
respective weights, and summing weighted band-pass power distributions. 

11- The system of claim 3, wherein the sound source position search unit estimates the sound source position or 
direction by detecting a peak in the synthesized band-pass power distribution and setting a position or direction 
corresponding to a detected peak as the sound source position or direction. 

12. A microphone an^y input type speech analysis method, comprising the steps of: 

inputting speeches in a plurality of channels using a microphone an^ay formed by a plurality of microphones; and 

analysing an input speech of each channel inputted by the inputting step, and obtaining band-pass wavefonns 
for each channel, each band-pass wavefonn being a waveform for each frequency bandwidth; 

characterized by 

calculating a band-pass power distribution as a function of sound source position or direction for each frequency 
bandwidth from the band-pass waveforms for each frequency bandwidth obtained by the analysing step, synthe- 
sizing calculated band-pass power distributions for a plurality of frequency bandwidths, and estimating a sound 
source position or direction from a synthesized band-pass power distribution. 

13. The method of claim 12, further comprising the step of: 

extracting a speech parameter from the band-pass power distribution for each frequency bandwidth calculated 
by the calculating step, according to the sound source position or direction estimated by the calculating step. 

14. The method of claim 13, further comprising the step of: 

obtaining a speech recognition result by matching the speech parameter extracted by the extracting step with 
a recognition dictionary. 
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15. The method of any one of claims 12, 13 and 14. wherein the calculating step includes the steps of: 

a low resolution sound source position estimation step for estimating a rough sound source position or direction, 
by minimizing an output power of the microphone array under constraints that responses of the microphone 
s array for a plurality of directions or positions are to be maintained constant; and 

• 

a high resolution sound source position estimation step for estimating an accurate sound source position or 
direction in a vicinity of the rough sound source position or direction estimated by the tow resolution sound 
source position estimation step, by minimizing the output power of the microphone array under constraints 
10 that a response of the microphone anray for one direction or position is to be maintained constant, wherein 

the extracting step extracts the speech parameter for speech recognition according to the accurate sound 
source position or direction. 

16. The method of claim 14, wherein the analysing step obtains the band-pass wavefomis for each channel by using 
15 a band-pass filter bank. 

17. The method of claim 14, wherein the calculating step calculates the band-pass power distribution for each fre- 
quency bandwidth, by calculating a band-pass power for each frequency bandwidth, in each one of a plurality of 
assumed sound source positions or directions within a prescribed search range. 

20 

18. The method of claim 14, wherein the calculating step calculates the band-pass power distribution for each fre- 
quency bandwidth by using a filter function configuration having a plurality of delay line taps for each channel. 

19. The method of claim 14, wherein the calculating step calculates the band-pass power distribution for each fre- 
25 quency bandwidth by using a minimum variance method for minimizing an output power of the microphone array 

under constraints that a response of the microphone an-ay for one direction or position is to be maintained constant. 

20. The method of claim 14, wherein the extracting step extracts the band-pass power distribution for each frequency 
bandwidth calculated by the calculating step for the sound source position or direction estimated by the calculating 

30 step directly as the speech parameter. 

21. The method of claim 14, wherein the calculating step synthesizes the calculated band-pass power distributions 
for a plurality of frequency bandwidths by weighting the calculated band-pass power distributions with respective 
weights, and summing weighted band-pass power distributions. 

35 

22. The method of claim 14, wherein the calculating step estimates the sound source position or direction by detecting 
a peak in the synthesized band-pass power distribution and setting a position or direction con-esponding to a 
detected peak as the sound source position or direction. 

40 

Patentansprtiche 

1 . Sprachanalysesystem mit Eingabe Qber eine MIkrofonanordnung, umfassend: 

^5 eine Spracheingabeeinheit (1 ) zur Eingabe von Sprache in einer Vielzahl von Kanalen unter Verwendung einer 

MIkrofonanordnung. welche durch eine Vielzahl von Mikrofonen gebildet wird; und 

eine Frequenzanalyseeinheit (2) zur Analyse der Spracheingabe jedes Kanals, welche durch die Sprachein- 
gabeeinheit eingegeben wird, und zum Erhalten von Bandpass-Signalformen jedes Kanals, wobei jede Band- 
so pass-Signalfonn eine Signalform fur jede Frequenzbandbreite ist; 

gekennzeichnet durch 

eine Schallquellenpositions-Sucheinheit (3) zur Berechnung einer Bandpass-Letstungsverteilung als Funktion der 
Schallquellenposition oder-richtung fur jede Frequenzbandbreite aus den Bandpass-Signalformen fQr jede durch 
55 die Frequenzanalyseeinheit erhaltene Frequenzbandbreite, zur Synthetisierung von berechneten Bandpass-Lei- 

stungsverteilungen fQr eine Vielzahl von Frequenzbandbreiten. und zur Abschatzung einer Schallquellenposition 
Oder -richtung aus einer synthetisierten Bandpass-Leistungsverteilung. 
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2. System nach Anspruch 1 , femer umfassend: 

eine Sprachparameter-Extraktionseinheit (4) zur Extrahierung eines Sprachparameters aus der Bandpass- 
Leistungsverteilung fur jede Frequenzbandbreite, die durch die Schallquellenpositions-Sucheinlieit berechnet 
wurde, gemdH der Schallqueltenposltion oder -richtung, die durch die Schatlquellenpositions-Sucheinheit ab- 
geschdtzt wurde. 

3. System nach Anspruch 2, femer umfassend: 

eine Spracherlcennungseinheit (5) zur Gewlnnung eines Spracherlcennungsresultats durch Vergleichen des 
durch die Sprachparameter-Extralctionseinheit extrahierten Sprachparameters mit einem Ericennungslexikon. 

4. System nach einem der Anspruche 1 , 2 und 3, wobei die Schallquellenpositions-Sucheinheit umfasst: 

eine Schallqueilenpositions-Abschgtzungseinheit niedriger Aufldsung zur Abschatzung einer groben Schall- 
quetlenposition oder -richtung, durch Minimierung der Ausgangsteistung der l\^ikrofonanordnung unter Rand- 
bedlngungen, dass Antworten der Mikrofonanordnung fQr eine Vielzahl von Richtungen oder Positionen kon- 
stant gehalten werden; und 

eine Schallqueilenposltions-Abschdtzungseinheit hoher Aufldsung zur Abschdtzung einer genauen Schall- 
quellenposition oder -richtung in der Nahe der groben Schallquellenposition- oder richtung, weiche durch die 
Schallquellenpositions-Abschatzungseinheit niedriger Aufldsung abgeschatzt wurde, durch Minimierung der 
Ausgangsteistung der Mikrofonanordnung unter Randbedingungen, dass eine Antwort der Mikrofonanordnung 
fur eine Richtung oder Position konstant gehalten werden soli, wobei die Sprachparameter-Extraktionseinheit 
den Sprachparameter fur Sprachericennung entsprechend der genauen Schallquellenposition oder -richtung 
extrahlert. 

5. System nach Anspmch 3, wobei die Frequenzanalyseeinheit die Bandpass-Signalformen fCir jeden Kanal unter 
Venvendung einer Bandpass-Filterbank gewinnt. 

6. System nach Anspruch 3, wobei die 

Schallquellenpositions-Sucheinheit die Bandpass-Leistungsverteilung fur jede Frequenzbandbrelte dadurch be- 
rechnet, dass die Bandpassleistung fur jede Frequenzbandbrelte berechnet wird, in jeder einer Vielzahl von an- 
genommenen Schallquellenpositionen oder - Richtungen innerhalb eines vorgeschriebenen Suchbereichs. 

7. System nach Anspruch 3, wobei die 

Schallquellenpositlons-Sucheinheit die Bandpass-Leistungsverteilung fQr jede Frequenzbandbrelte unter Venven- 
dung einer Filterfunktions-Konftguration berechnet, weiche eine Vielzahl von 
Verzogerungsleitungs-Abgriffen fur jeden Kanal hat. 

8. System nach Anspruch 3, wobei die 

Schallquellenpositions-Sucheinheitdle Bandpass-Leistungsverteilung fur jede Frequenzbandbrelte unter VenA^en- 
dung eines Verfahrens minimaler Varianz berechnet, um die Ausgangsleistung der Mikrofonanordnung unter Rand- 
bedingungen zu minimieren, dass die Antwort der Mikrofonanordnung fOr eine Richtung oder Position konstant 
gehalten wird. 

9. System nach Anspruch 3, wobei die Sprachparameter-Extraktionseinheit die Bandpass-Leistungsverteilung fur 
jede Frequenzbandbrelte, die durch die 

Schallquellenposltlons-Suchelnheit fur die von der Schallquellenpositions-Sucheinheit abgeschatzten Schallquel- 
lenposition Oder -richtung berechnet wurde. direkt als Sprachparameter extrahiert. 

10. System nach Anspruch 3, wobei die 

Schallquellenposttions-Sucheinheit die berechneten Bandpass-Leistungsverteilungen fur eine Vielzahl von Fre- 
quenzbandbreiten berechnet, indem die berechneten Bandpass-Leistungsverteilungen mit jeweillgen Gewichten 
gewichtet werden und die gewichteten Bandpass-Leistungsverteilungen summiert werden. 

11. System nach Anspruch 3, wobei die 

Schallquellenpositions-Sucheinheit die Schallquellenposition oder -richtung abschdtzt, indem einen Spitze in der 
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synthetisierten Bandpass-Lelstungsverteiiung erfasst wird, und eine Position oder Riclitung entsprechend einer 
erfessten Spitze als Schallqueltenpositlon oder -richtung eingestellt wird. 

12. Sprachanalyseverfahren mit Eingabe ubereine Mikrofonanordnung, umfassend die Schritte: 

Eingeben von Sprache in einer Vielzahl von KanSlen unter Venvendung einer Mikrofonanordnung, welche 
durch eine Vielzahl von Mikrofonen gebildet ist; und 

Analysieren der Spracheingabe jedes Kanals, welche durch den Eingabeschritt eingegeben wurde, und Ge- 
winnung von Bandpass-Signalformen fur jeden Kanal, wobei jede Bandpass-Signatform eine Signalform fur 
jede Frequenzbandbreite ist; 

gekennzeichnet durch 

Berechnen einer Bandpass-Leistungsverteilung als Funktion der Schallqueilenposition oder -richtung fur jede Fre- 
quenzbandbreite aus den Bandpass-Signalfomnen fur jede durch den Analyseschritt gewonnene Frequenzband- 
breite, Synthetisieren der berechneten Bandpass-Leistungsverteilungen fur eine Vielzahl von Frequenzbandbrei- 
ten, und Abschdtzen einer Schallqueilenposition oder -richtung aus einer synthetisieren Bandpass-Leistungsver- 
teilung. 

13. Verfahren nach Anspnjch 12, femer umfassend den Schritt: 

Extrahieren eines Sprachparameters aus der Bandpass-Leistungsverteilung fur jede durch den Berechnungs- 
schritt berechnete Frequenzbandbreite, gemall der durch den Berechnungsschritt abgeschatzten Schallquei- 
lenposition Oder -richtung. 

14. Verfahren nach Anspruch 13, femer umfassend den Schritt: 

Gewinnen eines Spracherkennungsresultats durch Vergleichen des durch den Extrahierungsschritt extrahier- 
ten Sprachparameters mit einem Erkennungslexikon. 

15. Verfahren nach einem der AnsprQche 12, 13, und 14, wobei der Berechnungsschritt die Schritte umfasst: 

einen Schallquellenpositions-Abschatzungsschritt niedriger Aufidsung zur Abschatzung einer groben Schall- 
queilenposition Oder -richtung durch Minimierung der Ausgangsleistung der Mikrofonanordnung unter Rand- 
bedlngungen, dass Antworten der Mikrofonanordnung fur eine Vielzahl von Richtungen Oder Positionen kon- 
stant gehalten werden sollen; und 

einen Schallquellenpositions-Abschatzungsschritt hoher Auflosung zur Abschatzung einer genauen Schall- 
queilenposition Oder -richtung in der Ndhe der groben Schallqueilenposition oder -richtung, die durch den 
Schallquellenpositions-Abschdtzungsschritt niedriger Aufidsung abgeschdtzt wurde. durch Minimierung der 
Ausgangsleistung der Mikrofonanordnung unter Randbedingungen, dass eine Antwort der Mikrofonanordnung 
fur eine Richtung oder Position konstant gehalten werden soli, wobei der Extraktionsschritt den Sprachpara- 
meter fur die Spracherkennung gemaH der genauen Schallqueilenposition oder -richtung extrahiert. 

16. Verfahren nach Anspruch 14, wobei der Analyseschritt die Bandpass-Signalfonnen fur jeden Kanal unter Venwen- 
dung einer Bandpass-Fllterbank gewinnt. 

17. Verfahren nach Anspruch 14, wobei der Berechnungsschritt die Bandpass-Leistungsverteilung fOr jede Frequenz- 
bandbreite berechnet, durch Berechnen einer Bandpasslelstung fOr jede Frequenzbandbreite, in jeder einer Viel- 
zahl von angenommenen Schallquellenpositionen oder -richtungen innerhalb eines vorgeschriebenen Suchbe- 
reichs. 

18. Verfahren nach Anspaich 14. wobei der Berechnungsschritt die Bandpass-Leistungsverteilung fur jede Frequenz- 
bandbreite unter Verwendung einer Filterfunktionskonfiguration berechnet, welche eine Vielzahl von Verzoge- 
rungsleitungsabgriffen fQr jeden Kanal hat. 

19. Verfahren nach Anspruch 14, wobei der Berechnungsschritt die Bandpass-Leistungsverteilung fur jede Frequenz- 
bandbreite unter VenArendung eines Ver^hrens minimaler Varianz berechnet, zur Minimierung der Ausgangslei- 
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stung der Mikrofonanordnung unter Randbedingungen, dass eine Antwort der Mikrofonanordnung fur eine Rich- 
tung Oder Position konstant gehalten warden soil. 

20. Verfaiiren nacli Anspruch 14, wobei der Extraktionsschritt die Bandpass-Leistungsverteilung fiir jede Frequenz- 
s bandbreite, welche durch den Berechnungsschritt fOr die durch den Berechnungsschritt abgeschdtzte Schaltquei- 

lenposition oder -Richtung berechnet wurde. direkt als Spracliparameter extrahiert. 

21. Verfahren nach Anspnjcli 14, wobei der Bereclinungsschritt die berechneten Bandpass-Leistungsverteilungen fQr 
eine Vielzalil von Frequenzbandbrelten synthetisiert, durch Gewichtung der berechneten Bandpass-Leistungsver- 

10 teilungen mit entsprechenden Gewichten, und der Summierung der gewichteten Bandpass-Leistungsverteilungen. 

22. Verfahren nach Anspruch 14, wobei der Berechnungsschritt die Schallquellenposition oder -richtung durch Erfas- 
sen einer Spitze in der synthetisierten Bandpass-Leistungsverteilung und Einsteilen einer Position oder Richtung 
entsprechend einer erfossten Spitze als Schallquellenposition Oder -richtung abschdtzt. 

Revendications 

1. Systeme d'analyse de la parole du type ^ entree par r^seau de microphones, comprenant : 

20 

une unit^ d'entree de parole (1) pour entrer des paroles dans une plurality de canaux en utilisant un r6seau 
de microphones qui est form6 par une plurality de microphones ; et 

une units d'analyse de frequence (2) pour analyser une parole d'entrSe de cheque canal qui est entree au 
moyen de I'unitS d'entrSe de parole et pour obtenir des formes d'onde passe-bande pour cheque canal , cheque 
25 forme d'onde passe-bande Stent une forme d*onde pour cheque largeur de bande de frequences, 

caract6ris6 par : 

une units de recherche de position de source de son (3) pour calculer une distribution de puissance passe- 
30 bande en tant que fonction d'une position ou d'une direction de source de son pour cheque largeur de bande 

de frequences a partir des formes d'ondes passe-bande pour cheque largeur de bande de frSquences comme 
obtenu au moyen de I'unitS d'analyse de frequence, pour synthetiser des distributions de puissance passe- 
bande calculSes pour une pluralite de largeurs de bande de frequences et pour estimer une position ou une 
direction de source de son S partir d'une distribution de puissance passe-bande synthStisSe. 

35 

2. Systeme selon la revendication 1 , comprenant en outre : 

une units d'extraction de paramStre de parole (4) pourextraire un paramStre de parole S partir de la distribution 
de puissance passe-bande pour cheque largeur de bande de frequences calculSe au moyen de I'unite de 
40 recherche de position de source de son conformSment S la position ou d la direction de source de son estimSe 

au moyen de t'unitS de recherche de position de source de son. 

3. SystSme selon la revendication 2, comprenant en outre : 

45 une units de reconnaissance de la parole (5) pour obtenir un rSsultat de reconnaissance de la parole en faisant 

correspondre le paramStre de parole qui est extrait au moyen de I'unitS d'extractlon de paramStre de parole 
avec un dictionnaire de reconnaissance. 

4. SystSme selon I'une queloonque des revendications 1, 2 et 3, dans lequel I'unitS de recherche de position de 
50 source de son indut : 

une units d'estimation de position de source de son basse rSsolution pour estimer une position ou une direction 
de source de son grossiSre, en minimisant une puissance de sortie du rSseau de microphones sous des 
contraintes consistant en ce que des reponses du reseau de microphones pour une pluralite de directions ou 
55 de positions doivent etre maintenues constantes ; et 

une units d'estimation de position de source de son haute rSsolution pour estimer une position ou une direction 
de source de son prScise au voisinage de la position ou de la direction de source de son grossiSre estimSe 
au moyen de I'unitS d'estimation de position de source de son basse rSsolution, en minimisant la puissance 
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de sortie du r^seau de microphones sous des contraintes consistent en ce qu*une r^ponse du r^seau de 
microphones pour une direction ou une position doit dtre maintenue constante, ou i'unit6 d'extraction de pa- 
ram^tre de parole extrait le param^tre de parole pour une reconnaissance de parole en fonction de la position 
ou de la direction de source de son precise. 

5. Systdme selon la revendication 3, dans iequel Tunitd d'analyse de frequence obtient les fbmies d'onde passe- 
bande pour cheque canal en utilisant un groupe de filtres passe-bande. 

6. Syst6me selon la revendication 3, dans Iequel runit6 de recherche de position de source de son calcule la distri- 
bution de puissance passe-bande pour cheque largeur de bande de frequences, en calculant une puissance passe- 
bande pour chaque largeur de bande de frequences, pour chacune d*une plurality de positions ou de directions 
de source de son prises d I'lnterieur d'une plage de recherche prescrite. 

7. Systeme selon la revendication 3, dans Iequel Tunrte de recherche de position de source de son calcule la distri- 
bution de puissance passe-bande pour chaque largeur de bande de frequences en utilisant une configuration de 
fonction de filtre comportant une plurailte de connexions intenmediaires de ligne de retard pour chaque canal. 

8. Systeme selon la revendication 3, dans Iequel I'unlte de recherche de position de source de son calcule la distri- 
bution de puissance passe-bande pour chaque largeur de bande de frequences en utilisant un precede par va- 
riance minimum pour minimiser une puissance de sortie du reseau de microphones sous des contraintes consistent 
en ce qu'une reponse du reseau de microphones pour une direction ou une position doit etre maintenue constante. 

9. Systeme selon la revendication 3, dans Iequel I'unite d*extraction de param6tre de parole extrait la distribution de 
puissance passe-bande pour chaque largeur de bande de frequences calcuiee au moyen de I'unite de recherche 
de position de source de son pour la position ou la direction de source de son estimee au moyen de I'unite de 
recherche de position de source de son directement en tant que parametre de parole. 

10. Systeme selon la revendication 3, dans Iequel i'unite de recherche de position de source de son synthetlse les 
distributions de puissance passe-bande caicuiees pour une plurallte de largeurs de bande de frequences en pon- 
derant les distributions de puissance passe-bande caicuiees d I'aide de poids respectifs et en sommant les distri- 
butions de puissance passe-bande ponderees. 

11. Systeme selon la revendication 3. dans Iequel I'unite de recherche de position de source de son estime la position 
ou la direction de source de son en detectant une crete dans la distribution de puissance passe-bande synthetisee 
et en etablissant une position ou une direction qui con-espond e une cr§te detectee en tant que position ou direction 
de source de son. 

12. Procede d'analyse de la parole du type e entree par reseau de microphones, comprenant les etapes de : 

entree de paroles dans une pluralite de canaux en utilisant un reseau de microphones qui est fomne par une 
pluralite de microphones ; et 

analyse d'une parole d*entree de chaque canal qui est entree au moyen de retape d'entree et obtention d'une 
forme d*onde passe-bande pour chaque canal, chaque forme d'onde passe-bande etant une forme d'onde 
pour chaque largeur de bande de frequences, 

caracterise par : 

le calcui d'une distribution de puissance passe-bande en tant que fonction d'une position ou d'une direction 
de source de son pour chaque largeur de bande de frequences d partir des formes d'onde passe-bande pour 
chaque largeur de bande de frequences comme obtenu au moyen de retape d'analyse, la synthese de distributions 
de puissance passe-bande caicuiees pour une pluralite de largeurs de bande de frequences et Testimation d'une 
position ou d'une direction de source de son e partir d'une distribution de puissance passe-bande synthetisee. 

13. Procede selon la revendication 12. comprenant en outre I'etape de : 

extraction d'un parametre de parole e partir de la distribution de puissance passe-bande pour chaque largeur 
de bande de frequences calcuiee au moyen de retape de calcul, conformement e la position ou e la direction 
de source de son qui est estimee au moyen de retape de calcul. 
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14. Proc6cl6 selon la revendication 13, comprenant en outre I'^tape de : 

obtentlon d'un r^suitat de reconnaissance de parole en faisant con-espondre le param§tre de parole qui est 
extrait au moyen de l*^tape d'extraction avec un dictionnaire de reconnaissance. 

15. Proc6d6 selon Tune quelconque des revendlcations 12, 1 3 et 14, dans lequel T^tape de calcul inclut les Stapes de : 

une 6tape d'estimatton de position de source de son basse resolution pour estimer une position ou une direction 
de source de son grossiere en minlmlsant une puissance de sortie du r^seau de microphones sous des con- 
traintes consistant en ce que des rSponses du rdseau de microphones pour une plurality de directions ou de 
positions doivent §tre maintenues constantes ; et 

une dtape d'estimatlon de position de source de son haute resolution pour estimer une position ou une direction 
de source de son precise au voisinage d'une position ou d'une direction de source de son grossiere estimee 
au moyen de TStape d'estimation de position de source de son basse resolution, en minimisant la puissance 
de sortie du rSseau de microphones sous des contraintes consistant en ce qu*une rSponse du r§seau de 
microphones pour une direction ou une position doit dtre maintenue constante, ou Tetape d'extraction extrait 
le parametre de parole pour une reconnaisse de parole conformdment d la position ou d la direction de source 
de son precise. 

16. Precede selon la revendication 14, dans lequel retape d*analyse obtient les formes d'onde passe-bande pour 
chaque canal en utilisant un groupe de filtres passe-bande. 

17. Precede selon la revendication 14, dans lequel retape de calcul calcule la distribution de puissance passe-bande 
pour chaque largeur de bande de frequences en calculant une puissance passe-bande pour chaque largeur de 
bande de frequences pour chacune d'une pluralite de positions ou de directions de source de son prises a rinterieur 
d'une plage de recherche prescrite. 

18. Procede selon la revendication 14, dans lequel retape de calcul calcule la distribution de puissance passe-bande 
pour chaque largeur de bande de frequences en utilisant une configuration de fonction de filtre comportant une 
pluralite de connexions intermedlalres de ligne de retard pour chaque canal. 

19. Precede selon la revendication 14, dans lequel retape de calcul calcule la distribution de puissance passe-bande 
pour chaque largeur de bande de frequences en utilisant un precede par variance minimum pour minimiser une 
puissance de sortie du reseau de microphones sous des contraintes consistant en ce qu'une reponse du reseau 
de microphones pour une direction et une position dolt etre maintenue constante. 

20. Precede selon la revendication 14, dans lequel retape d'extraction extrait la distribution de puissance passe-bande 
pour chaque largeur de bande de frequences calcuiee au moyen de retape de calcul pour la position ou la direction 
de source de son estimee au moyen de retape de calcul directement en tant que parametre de parole. 

21. Precede selon la revendication 14, dans lequel retape de calcul synthetise les distributions de puissance passe- 
bande calcuiees pour une pluralite de iargeurs de bande de frequences en ponderant les distributions de puissance 
passe-bande calcuiees avec des poids respectifs et en sommant les distributions de puissance passe-bande pon- 
derees. 

22. Precede selon la revendication 14, dans lequel retape de calcul estime la position ou la direction de source de 
son en detectant une crete dans la distribution de puissance passe-bande synthetisee et en etablissant une position 
ou une direction qui correspond e une crete detectee en tant que position ou direction de source de son. 
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